Mapping the distribution of sandflies and sandfly-associated pathogens in China

Background Understanding and mapping the distribution of sandflies and sandfly-associated pathogens (SAPs) is crucial for guiding the surveillance and control effort. However, their distribution and the related risk burden in China remain poorly understood. Methods We mapped the distribution of sandflies and SAPs using literature data from 1940 to 2022. We also mapped the human visceral leishmaniasis (VL) cases using surveillance data from 2014 to 2018. The ecological drivers of 12 main sandfly species and VL were identified by applying machine learning, and their distribution and risk were predicted in three time periods (2021–2040, 2041–2060, and 2061–2080) under three scenarios of climate and socioeconomic changes. Results In the mainland of China, a total of 47 sandfly species have been reported, with the main 12 species classified into three clusters according to their ecological niches. Additionally, 6 SAPs have been identified, which include two protozoa, two bacteria, and two viruses. The incidence risk of different VL subtypes was closely associated with the distribution risk of specific vectors. The model predictions also revealed a substantial underestimation of the current sandfly distribution and VL risk. The predicted areas affected by the 12 major species of sandflies and the high-risk areas for VL were found to be 37.9–1121.0% and 136.6% larger, respectively, than the observed range in the areas. The future global changes were projected to decrease the risk of mountain-type zoonotic VL (MT-ZVL), but anthroponotic VL (AVL) and desert-type zoonotic VL (DT-ZVL) could remain stable or slightly increase. Conclusions Current field observations underestimate the spatial distributions of main sandfly species and VL in China. More active surveillance and field investigations are needed where high risks are predicted, especially in areas where the future risk of VL is projected to remain high or increase.


S1 Appendix
Supplement to: Mapping the distribution of sandflies and sandflyassociated pathogens in China

Supplementary Materials and Methods
Text A: Data Collection and Management Database of Sandflies and Sandfly-associated Pathogens We included only studies that reported clearly recognizable sandfly species and sandfly-associated pathogens.For ambiguous data, we contacted the authors to obtain the original data or remove them from our database if clarification was not possible.We extracted county information or exact coordinates from each geographic location where sandflies or sandfly-associated pathogens were reported.
For our analyses, each record was defined at the county level, indicated as one specific sandfly species or specific SAP that was reported at least once in the same county (e.g., surveys in different townships within the same county).If presence of sandfly species was reported at a larger scale than the county level, e.g., at the provincial or municipal level, the data were only used for descriptive analysis, while not used for modeling.By applying the same method as above, we also collected data on sandfly-associated pathogens in animals.As a supplementation, we also extracted information on the sandfly-associated pathogens that had been uploaded to the GenBank database using the same criteria and method.

Climate data
We obtained historical and future climate data from the WorldClim website as GeoTiff files with a spatial resolution of 2.5' (http://www.worldclim.org).For the historical data, we generated bioclimatic variables as predictors for our models using the R package "dismo", based on the monthly maximum temperature, monthly minimum temperature, and precipitation data from 1980-2018.These bioclimatic variables have been widely used in ecological studies [1], which can better capture seasonal trends that are related to the physiological constraints of different sandfly species, as compared to traditional meteorological variables.
Due to the high degree of correlation among bioclimatic variables, we conducted a cluster analysis using the R package "NbClust" to analyze pairwise correlations.Specifically, we formed a binary distance matrix in which the distance between any pair of bioclimatic variables was designated as 0 if the absolute value of their correlation coefficient exceeded 0.8 and 1 otherwise.The optimal number of clusters was determined using the Krzanowski and Lai index [2].Based on the average values of bioclimatic variables from 1980-2018, this analysis identified eight distinct clusters (Table E in S1 Appendix).Only one variable from each cluster was used for model-fitting.
We obtained mean values of bioclimatic variables for the periods 2021-2040, 2041-2060, and 2061-2080 from the WorldClim website, respectively.We referenced the performance of General Circulation Models (GCMs) in simulating the regional climatic response to rising Greenhouse Gas (GHG) concentration in China and selected three models: CNRM-CM6-1 (Centre National de Recherches Météorologiques (CNRM) for CIMP6), GISS-E2-1-G (Goddard Institute for Space Studies) and MRI-ESM2-0 (Meteorological Research Institute Earth System Model) [3].Under each scenario, different GCMs generate varying predictions of bioclimatic variables within the region [4].To balance this variability, we calculated the final value of bioclimatic variables as the mean of the predicted values generated by these three GCMs.Terrain data Terrain data were obtained from the Resource and Environment Science and Data Center (RESDC) (http://www.resdc.cn),with a spatial resolution of 250m.To accurately represent topography at the county scale, we additionally calculated the average and standard deviation of elevation, which were used as predictors in our modeling analysis (Table B in S1 Appendix).

Socioeconomic data
Gridded datasets about population projections under Shared Socioeconomic Pathways were obtained from the Science Data Bank, featuring a spatial resolution of 0.5° and a temporal range from 2010 to 2100.Using rural and total population data, we calculated population density, rural population density, and rural population proportion as predictors in our modeling analysis (Table B in S1 Appendix).All these variable measurements were calculated by matching the study time for modeling.The future data for variables at three-time points (2030,2050,2070) were used to predict the future changes.All data were processed into raster form using the R package "terra" and specific values for each variable were extracted at the county level.

Case data
The data were collected without any age or patient-based restrictions.To mitigate the influence of imported cases from within and outside China, we excluded case data from non-endemic provinces.Endemic provinces were defined as those with a history of confirmed local cases of VL in the past 10 years through epidemiological case investigations.Non-endemic provinces were defined as those without records of local distribution of the sandfly vectors for VL transmission or without confirmed local cases of VL in the past 10 years through epidemiological investigations [5,6].

Text B: Ecological modeling and clustering analysis of sandflies Ecological Modeling of sandflies
To explore the relationship between the environmental suitability of sandflies and county-level variable values, we used the boosted regression tree (BRT) methodology.BRT is a predictive learning algorithm that combines traditional regression tree or classification tree with gradient boosting, which effectively models complex response functions while avoiding overfitting.This method is widely used for vector distribution mapping [8,9].
where  is the outcome and  is the vector of predictors for individual observation ,  = 1,2, ⋯ , .  is the number of trees, ℎ( ;  ) returns the terminal node of a tree defined by parameters  (representing both selected predictors and splitting values) for an input  , and  is the expansion coefficient.Starting from some initial estimation of ( ),  ( ), the tree structures and coefficients are determined sequentially by the following optimization for  = 1,2, ⋯ , , where  ( ) =  ( ) +  ℎ(  ;   ), and ( , ) is the loss function.Similar to other ecological models, the BRT model requires defining presence and absence data as response variables.To reduce the sampling bias for the surveyed counties, we established a logistic regression model to predict the sampling probabilities of all the counties with the 13 selected variables used as predictors.The reciprocals of the predicted sampling probabilities of all the surveyed counties were rescaled to have a mean value of 1 and the rescaled values were subsequently used as weights in the BRT models [10][11][12].The response variable was assigned "1" for the counties that had been surveyed and 0 for the unsurveyed.A backward selection procedure was employed to choose the variables at a significance level of 0.05.
Model parameters were determined based on the satisfactory performance observed in our previous study, with a tree complexity of 5, a learning rate of 0.005, and a bagging rate of 75% for the primary analysis [13][14][15].We performed 10-fold cross-validation using the "gbm.step"function available in the R package "dismo" to ascertain the optimal number of trees.To quantify model uncertainty and enhance model robustness, we constructed an ensemble model comprising 100 individual BRT models.
To provide a more robust and parsimonious estimation of model parameters, a two-stage bootstrapping procedure was employed.Within each stage, the following split-and-fit step was iteratively executed for a predetermined number of repetitions.A training set comprising 75% of data points was randomly selected via bootstrapping without replacement, and the remaining 25% served as a test set.The BRT model was constructed using the training set and subsequently applied to the test set for validation when necessary.The output generated by the BRT model includes both predicted environmental suitability and relative contribution (RC) (or influence) of each variable.The RC is calculated based on the frequency with which a variable is selected for splitting and the degree to which each splitting improves the objective function averaging over all trees.The RCs of all variables are normalized such that their summation equals 1 [16].At the first stage, the split-fitting step was iteratively executed ten times to screen for significant variables.At this stage, the trained model was not validated using the test set.All variables with RC <2% from the bootstrap training set were excluded from the next stage.At the second stage, the split-fitting step was repeated 100 times using the remaining variables.Since no variable selection was performed at this stage, all 100 models contained the same variables but produced different contribution estimates.The final RCs of the variables are the average that were calculated from all 100 BRT models.The receiver operating characteristic (ROC) curves and areas under curve (AUC) based on the test sets were also averaged to represent the final predictive performance.The standard deviations and 95% percentiles of the RCs and AUCs across all 100 models were employed to quantify the uncertainty in the estimation.Considering the potential false negative and false positive counties within observed data, partial area AUCs with a tolerance level of 0.2 for omission error were also calculated.For partial area AUCs, the horizontal axis represents the total rate of positives rather than false positives.The ratio of partial AUC to the area under the random selection line (diagonal line) was presented as suggested by Peterson et al [17].We also conducted a sensitivity analysis using a learning rate of 0.01 for selected sandfly species.However, our findings indicated no significant deviation in the contribution estimates.Owing to the substantial volume of data (comprising 14 predictors) and the number of models runs (encompassing 12 sandflies and 100 iterations), we cannot afford a complete cross-validation optimization for all model.BRT modeling was conducted using the R packages "dismo" and "gbm", and predictive performance was assessed using the R packages "ROCR" and "pROC" in the R v4.1.2(R Core Team, 2021).Clustering of sandflies with similar ecological niches and their spatial distribution.We generated features for clustering by the following steps.First, we eliminated predictors that were non-influential, i.e., not remained in final models, for all the 12 sandfly species.Second, we computed three properties for each remaining predictor and each sandfly species: i) the average relative contribution of the predictor in the final 100 BRT models, or zero if the predictor was not in the final models for that sandfly species; ii) the difference of the predictor between case counties (positive for that sandfly species) and all counties, measured by the quartile location (1)(2)(3)(4) of the median value of the predictor among case counties relative to all counties; and iii) the linear correlation between the predictor and the model-predicted presence probabilities of the sandfly species among all counties (averaged over 100 models).Finally, we used these three indicators as features for clustering.Text C: Ecological Modeling of VL Ecological Modeling of VL Among the machine learning methods used in practice, gradient boosted regression tree (GBRT) is one technique that shines in many applications.We utilized the implementation of GBRT available in the R package "xgboost" [18].Besides the same variables as in sandfly species distribution model, we included four sandfly species used as variables to account for the influence of vectors on VL incidence (Table B in S1 Appendix).The screening process for all variables was conducted like in the sandfly BRT model.We matched variable values by year for each county to accurately capture characteristic information.
At the first stage, a logistic extreme gradient boosting (XGBoost) model was employed to fit the presence or absence of VL cases in each county.All counties where human cases were reported during surveillance from 2014-2018 were designated as "presence", and the remaining counties were designated as "absence".The modeling process was first repeated ten times with all predictors included, then predictors with RCs <2% were excluded.Subsequently, the modeling was repeated 100 times, and the final results (estimated RCs and response curves) were averaged over 100 models.Any county with a predicted VL suitability value above the threshold was considered at risk.This stage accounts for the excessive amount of zero case numbers in the majority of the nation.
At the second stage, counties with a non-zero average annual incidence of reported human cases from 2014 to 2018 were fitted with the XGBoost model.This model assumed that VL incidence adheres to a gamma distribution.The gamma distribution was selected due to it best fits the observed non-zero average VL annual incidence.Like the first-stage predictor screening process, we first repeated the modeling five times with all predictors included, then excluded predictors with RCs <2%.The model fitting process was repeated five times, each time with one of the years 2014-2018 as the testing data and the remaining years as the training data.The results (estimated relative contributions and response curves) are averaged over the five models.Two-stage model parameters were set to a learning rate of 0.1, a maximum tree depth of 5, and a bag fraction of 0.8 for the run.The optimal number of trees (referred to as "nrounds" in xgboost) was determined by 5-fold cross-validation.The XGBoost models were developed using the "xgboost" package in R v4.1.2(R Core Team.2021).Supplementary Tables Table A: The specific references for all 47 sandfly species in China from 1940 to 2022.

Fig A:
Fig A: The spatial distribution of the 673 counties with at least one record of sandflies (yellow) from 1940 to 2022, China.Base layers of the maps were downloaded from Resource and Environment Science and Data Center (https://www.resdc.cn/DOI/DOI.aspx?DOIID=120).
Fig A: The spatial distribution of the 673 counties with at least one record of sandflies (yellow) from 1940 to 2022, China.Base layers of the maps were downloaded from Resource and Environment Science and Data Center (https://www.resdc.cn/DOI/DOI.aspx?DOIID=120).

Fig B :
Fig B:The number of records related to sandflies in different provinces in different periods (with the number of publications in parentheses) and a comparison of the total number of records for four VL vector sandflies in each province.

Fig D :
Fig D: The spatial distribution of the sandfly genus Sergentomyia recorded at the county level from 1940 to 2022 in China.Base layers of the maps were downloaded from Resource and Environment Science and Data Center (https://www.resdc.cn/DOI/DOI.aspx?DOIID=120).

Fig E :
Fig E: The spatial distribution of the sandfly genus Chinius and Idiophlebotomus recorded at the county level from 1940 to 2022 in China.Base layers of the maps were downloaded from Resource and Environment Science and Data Center (https://www.resdc.cn/DOI/DOI.aspx?DOIID=120).

FigF:
Fig F: Sandfly species richness (circles) at the prefecture level in seven biogeographic zones in the mainland of China from 1940 to 2022.Base layers of the maps were downloaded from Resource and Environment Science and Data Center (https://www.resdc.cn/DOI/DOI.aspx?DOIID=120).I=Northeast district, II=North China district, III=Inner Mongolia-Xinjiang district, IV= Qinghai-Tibet district, V=Southwest China district, VI=Central China district and VII=South China district.

Fig I :
Fig I:The mean curves (red) and 95% percentiles (gray) for the effects of major predictors (RC ≥5%) on the probability of occurrence of 12 main sandfly species based on the ensemble of BRT models.

Fig
Fig K: XGBoost-model-predicted MT-ZVL incidence in response to major predictors (RC ≥5%) when other predictors are fixed at mean values.The red curves and gray bands show the average and range, respectively, of predicted incidences from five XGBoost models, each with one of the years 2014-2018 as the testing set and the remaining years as the training set.

Fig N :
Fig N: Spatial distribution and changes in model-predicted environmental suitability of P. wui in future under three scenarios.Base layers of the maps were downloaded from Resource and Environment Science and Data Center (https://www.resdc.cn/DOI/DOI.aspx?DOIID=120).

Fig O :
Fig O: Spatial distribution and changes in model-predicted environmental suitability of P. chinensis in future under three scenarios.Base layers of the maps were downloaded from Resource and Environment Science and Data Center (https://www.resdc.cn/DOI/DOI.aspx?DOIID=120).

Fig P :
Fig P: Spatial distribution and changes in model-predicted environmental suitability of P. longiductus in future under three scenarios.Base layers of the maps were downloaded from Resource and Environment Science and Data Center (https://www.resdc.cn/DOI/DOI.aspx?DOIID=120).

Fig Q :
Fig Q: Spatial distribution and changes in model-predicted environmental suitability of P. alexandri in future under three scenarios.Base layers of the maps were downloaded from Resource and Environment Science and Data Center (https://www.resdc.cn/DOI/DOI.aspx?DOIID=120).
values for different time periods separated by slashes.* The relative differences (%) was estimated for the predicted values as compared to those of the previous period, i.e., from actual observations to projection during 2014-2018, from the projection during 2014-2018 to 2021-2040, from the projection during 2021-2040 to 2041-2060, and from the projection during 2041-2060 to 2061-2080, that were intermitted by slash in the parentheses."-" indicates that the predicted values for the counties, areas, or population size affected by VL cases are zero for all periods.VL visceral leishmaniasis; SSP Shared Socioeconomic Pathway; AVL&DT-ZVL anthroponotic VL and desert-type zoonotic VL; MT-ZVL mountain-type zoonotic VL.

Table G :
BRT-model-estimated mean (standard deviation) relative contributions of top factors (RC ≥5%) to the spatial distribution of six most prevalent sandfly species in the Phlebotomus genus .Table H: BRT-model-estimated mean (standard deviation) relative contributions of top factors (RC ≥5%) to the spatial distribution of six most prevalent sandfly species in the Sergentomyia genus Table I: Projections of the numbers, land areas

Table B :
The inclusion and exclusion criteria for screening publications

Table C :
Original resolutions and extents of source datasets

Table D :
Potential risk factors at the county level used in the BRT model for sandfly species and 2-stage XGBoost model for VL

Table E :
Clustering analysis of model predictors at the county level based on pairwise Pearson correlation coefficients .8 are shown, and blank off-diagonal cells all have correlations < 0.8.Predictors grouped to the same cluster are colored the same.From each cluster, only one predictor (marked with a) is chosen for models to avoid multicollinearity.

Table F :
Cross-tabulation of observed and XGBoost-model-predicted annual incidence levels of VL in 2016 The observed and model-predicted levels were deemed consistent if the most of frequencies occurred in diagonal cells.The presented frequencies are the averages over five models, each with one of the years 2014 to 2018 as the testing set and the remaining years as the training set.

Table G :
BRT-model-estimated mean (standard deviation)relative contributions of top factors (RC ≥5%) to the spatial distribution of six most prevalent sandfly species in the Phlebotomus genus

Table H :
BRT-model-estimated mean (standard deviation)relative contributions of top factors (RC ≥5%) to the spatial distribution of six most prevalent sandfly species in the Sergentomyia genus

Table I :
Projections of the numbers, land areas, and population sizes of counties affected by VL risk areas according to SSP126

Table J :
Projections of the numbers, land areas, and population sizes of counties affected by VL risk areas according to SSP245